GitHub:

Analysis of Relation between NYTimes Coronavirus Article & Public Opinion Polls

Introduction

While the broadsheet decline with the popularity of social media such as Facebook and Twitter, and even influencers on Instagram or Youtube have millions of followers, whether the mainstream media could impact on the public remains a question. Therefore, the project is to make a study on the New York Times articles of Coronavirus and the public opinion polls to have a rough idea of if there is any relation between a broadsheet and its influence on the general public.

Sentiment analysis is extracting the perception, which indicates a positive, negative, or neutral sentiment, of people towards a particular issue, brand, scheme, etc., from textual data. It has a wide range of applications, from brand monitoring, product review analysis to policymaking.

For this project, the method of sentiment analysis and a simple count of newspaper coverage on the relevant topic will be implemented on the articles after using several sets of query words to retrieve search results.

Dataset

The datasets contain two parts: one is NYT search result whose schema is cleaned into four columns: created_time, headline, web_url, and content; the other is poll data collected from FiveThirtyEight, a website focusing on opinion poll analysis whose GitHub repositories have such data aggregation, and hence there are raw data for direct use.

For NYT articles, the keywords are limited to four prefixes of coronavirus, Covid-19, pandemic, and epidemic, and three suffixes of Trump, infection, and economy. The latter three are the most popular topics of the epidemic for U.S.A people and even worldwide. Thus, there are 12 kinds of combinations. On account of NYT developer API limiting each request no more than 2000 times, the date range should be set in three months from February to April, the search type in an article one, and the result order in a relevance sorting.

#################################################################################
####            function - search news article with API                      ####
nytime = function (keyword) {
  searchQ = URLencode(keyword)
  url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?q=',searchQ,
              '&begin_date=20200201&end_date=20200430&sort=relevance&fq=document_type:"article"&api-key=',api.key.nytimes,sep="")
  #get the total number of search results
  initialsearch = fromJSON(url,flatten = T)
  maxPages = round((initialsearch$response$meta$hits / 10) - 1)
  
  #try with the max page limit at 200
  maxPages = ifelse(maxPages >= 199, 199, maxPages)
  
  #creat a empty data frame
  df = data.frame(created_time=character(), headline=character(),web_url = character())
  
  #save search results into data frame
  for(i in 0:maxPages){
    #get the search results of each page
    tryCatch({
      nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T) 
      temp = data.frame(created_time = nytSearch$response$docs$pub_date,
                        headline = nytSearch$response$docs$headline.main,
                        web_url = nytSearch$response$docs$web_url)
      df=rbind(df,temp)
      Sys.sleep(6) #sleep for 6 second
    }, error = function(e) return(NA_character_))
    
  }  
  return(df)
}
# retrieve full body
body_text <- function(url_list) {
  content_list = rep(NA, NROW(url_list))
  for(i in 1:NROW(url_list)){
    content_list[i] = tryCatch({
    xml2::read_html(url_list[i]) %>% rvest::html_nodes(".StoryBodyCompanionColumn p") %>%
  rvest::html_text() %>% paste(collapse = "\n")
    }, error = function(e) return(NA_character_))
  }

  return(content_list)
}

The next step is to clean and wrangle the search result, including removing rows with duplicates and missing values, converting the columns of created_time and content into the formats of Date and Character. The table below is NYTimes search result of Trump and there are 2606 pieces in total. The Infection one has 2937 results while the Economy one has 2867 results.

## # A tibble: 6 x 4
##   created_time headline             web_url             content            
##   <date>       <chr>                <chr>               <chr>              
## 1 2020-03-17   Citing Coronavirus,… https://www.nytime… "WASHINGTON — The …
## 2 2020-04-25   Coronavirus and the… https://www.nytime… "In an interview w…
## 3 2020-04-29   Coronavirus Casts U… https://www.nytime… "WASHINGTON — Two …
## 4 2020-04-27   In Kayleigh McEnany… https://www.nytime… "WASHINGTON — Kayl…
## 5 2020-04-28   Trump Vows More Cor… https://www.nytime… "WASHINGTON — Pres…
## 6 2020-04-30   Trump’s Disinfectan… https://www.nytime… "SAN FRANCISCO — M…
## # A tibble: 6 x 3
##   created_time headline                       web_url                      
##   <date>       <chr>                          <chr>                        
## 1 2020-03-15   A Complete List of Trump’s At… https://www.nytimes.com/2020…
## 2 2020-03-06   As Coronavirus Spreads, How Y… https://www.nytimes.com/2020…
## 3 2020-03-04   Nursing Homes Are Starkly Vul… https://www.nytimes.com/2020…
## 4 2020-03-02   Outbreak Strikes Seattle Area… https://www.nytimes.com/2020…
## 5 2020-02-27   Iran Vice President Is One of… https://www.nytimes.com/2020…
## 6 2020-02-27   Germany Tries to Solve a Coro… https://www.nytimes.com/2020…
## # A tibble: 6 x 3
##   created_time headline                      web_url                       
##   <date>       <chr>                         <chr>                         
## 1 2020-03-19   Witnessing the Birth of the … https://www.nytimes.com/2020/…
## 2 2020-03-26   The Coronavirus Economy: Whe… https://www.nytimes.com/2020/…
## 3 2020-04-29   Wall Street Rallies as the F… https://www.nytimes.com/2020/…
## 4 2020-04-29   Fed Suggests Tough Road Ahea… https://www.nytimes.com/2020/…
## 5 2020-04-29   Fed Weighs Next Steps to For… https://www.nytimes.com/2020/…
## 6 2020-04-26   There’s Really Only One Way … https://www.nytimes.com/2020/…

Meanwhile, for the opinion poll data, FiveThirtyEight collects the whole package of pollster and sample information and calculates the approval and disapproval rates for the subject of Trump, Infection, and Economy. The raw data is presented as follows. The columns such as startdate, enddate, approve_adjusted, and disapprove_adjusted would be selected for further processing. The final table should look like a key-value pair, {average_date: [approve_adjusted, disapprove_adjusted]}. Moreover, for the concern-over-economy-or-infection poll data, there are also columns of startdate and enddate, while four degrees from very, somewhat, not_very, to not_at_all should also be included in the processed table.

## # A tibble: 6 x 19
##   subject modeldate party startdate enddate pollster grade samplesize
##   <chr>   <chr>     <chr> <chr>     <chr>   <chr>    <chr>      <dbl>
## 1 Trump   5/4/2020  D     2/2/2020  2/4/20… YouGov   B-           523
## 2 Trump   5/4/2020  D     2/7/2020  2/9/20… Morning… B/C          817
## 3 Trump   5/4/2020  D     2/9/2020  2/11/2… YouGov   B-           510
## 4 Trump   5/4/2020  D     2/16/2020 2/18/2… YouGov   B-           529
## 5 Trump   5/4/2020  D     2/23/2020 2/25/2… YouGov   B-           525
## 6 Trump   5/4/2020  D     2/24/2020 2/26/2… Morning… B/C          786
## # … with 11 more variables: population <chr>, weight <dbl>,
## #   influence <dbl>, multiversions <chr>, tracking <lgl>, approve <dbl>,
## #   disapprove <dbl>, approve_adjusted <dbl>, disapprove_adjusted <dbl>,
## #   timestamp <chr>, url <chr>
## # A tibble: 6 x 23
##   subject modeldate party startdate enddate pollster grade samplesize
##   <chr>   <chr>     <chr> <chr>     <chr>   <chr>    <chr>      <dbl>
## 1 concer… 5/4/2020  all   1/27/2020 1/29/2… Morning… B/C         2202
## 2 concer… 5/4/2020  all   1/31/2020 2/2/20… Morning… B/C         2202
## 3 concer… 5/4/2020  all   2/7/2020  2/9/20… Morning… B/C         2200
## 4 concer… 5/4/2020  all   2/13/2020 2/18/2… Kaiser … <NA>        1207
## 5 concer… 5/4/2020  all   2/24/2020 2/26/2… Morning… B/C         2200
## 6 concer… 5/4/2020  all   2/27/2020 2/27/2… SurveyM… <NA>        1051
## # … with 15 more variables: population <chr>, weight <dbl>,
## #   influence <dbl>, multiversions <lgl>, tracking <lgl>, very <dbl>,
## #   somewhat <dbl>, not_very <dbl>, not_at_all <dbl>, very_adjusted <dbl>,
## #   somewhat_adjusted <dbl>, not_very_adjusted <dbl>,
## #   not_at_all_adjusted <dbl>, timestamp <chr>, url <chr>
melted_corpus = function(complete_corpus) {
  sentiment_by_day = complete_corpus %>%
    select(web_url, content) %>%
    unnest_tokens(word, content) %>%
    anti_join(get_stopwords(), by = "word") %>%
    inner_join(lexicon_afinn(), by = "word") %>%
    group_by(web_url) %>%
    summarize(sentiment = sum(value)) %>%
    left_join(complete_corpus, by = "web_url") %>%
    select(created_time, sentiment) %>%
    group_by(date = created_time) %>%
    summarize(sentiment = mean(sentiment), n = n())
  
  return(sentiment_by_day)
}
df %>% melted_corpus() %>% ggplot(aes(x = date, y = sentiment)) +
                                    geom_bar(stat = "identity", position = "identity")

infection_report = NYT_infection %>% mutate(date = as.Date(created_time)) %>%
  group_by(date) %>% summarize(n = n()) %>%
  ggplot(aes(date, n)) + geom_bar(stat = "identity", alpha = 0.5) +
  facet_wrap(~"Report Number on Infection")
concern_infection = concern_polls %>%
  filter(subject == "concern-infected") %>%
  mutate(startdate = as.Date(startdate, format = "%m/%d/%y"), enddate = as.Date(enddate, format = "%m/%d/%y")) %>%
  rowwise() %>% mutate(meandate = as.character(mean.Date(c(startdate, enddate)))) %>%
  select(meandate, very = very_adjusted, somewhat = somewhat_adjusted, not_very = not_very_adjusted, not_at_all = not_at_all_adjusted) %>%
  group_by(meandate) %>%
  summarize(very = mean(very), somewhat = mean(somewhat), not_very = mean(not_very), not_at_all = mean(not_at_all)) %>%
  mutate(date = as.Date(meandate)) %>%
  filter(date <= "2020-04-30", date >= "2020-02-01") %>%
  gather(degree, Percentage, very: not_at_all) %>%
  select(date, degree, Percentage) %>%
  ggplot(aes(date, Percentage, color = degree)) +
  geom_point() + geom_smooth(method = "loess") +
  facet_wrap(~"Concern over Infection")+
  theme(panel.grid.major = element_line(colour = 'transparent'),
        plot.title = element_text(hjust = 0.5),
        axis.title.x=element_blank(),
        legend.title=element_blank(),
        panel.background=element_blank(),
        panel.border=element_blank(),
        panel.grid.minor=element_blank(),
        plot.background=element_blank())

ggplotly(concern_infection)
#subplot(ggplotly(concern_infection), ggplotly(infection_report), nrows = 2, margin = 0.04, heights = c(0.6, 0.4), titleX = TRUE, titleY = TRUE, which_layout = 1)